Minchiuan, May 2023
- Why we need reinforcement learning
+ 1.1. real cases need
+ 1.2. RL's differences with supervised learning
- Markov Decision Process
- First Step Monte Carlo Methods
- Temperal-Difference Learning
- Deep Reinforcement Learning
- Deep Q Learning
- Policy Gradient
- Actor-Critic
- Trust Region Policy Optimization (TRPO)
- Proximal Policy Optmization
References:
In the previous lesson, we've learned Generative Pretrained Transformer. By this method, we can project different words to different vectors explicitly.
Actually, "GPT" model maximum the $argmax_{\theta}Pr(w_n | w_1, w_2, \dots, w_{n-1} ; \theta)$
e.g-1
Give two words, $w_1, w_2$ (今天晚上), which probability of the fellowing will be bigger:
$$ Pr(w_y | w_1, w_2, w_x) Pr(w_x | w_1, w_2) Pr(w_2 | w1)$$
the probability of sentence 1 will bigger than sentence2. Therefore, given the $w_1, w_2$ as "今天 晚上", this model will generate "$w_x : 吃$"
But for a generation task, the sentence 2 may be better.
Therefore, we need model can produce some results that may not be the model initially generated.
However, in order to generate some more readable sentences, how to let some model know if its result is good or not?
Mathematically, in the previous training task, we've defined a model that generates next token by given so many labels.
$$ \text{dataset} = \begin{bmatrix} x: (w_i, w_{i+1}, .., w_{i+k}), y: w_{i+k+1} \\ x: (w_j, w_{j+1}, .., w_{j+k}), y: w_{j+k+1} \\ x: (w_m, w_{m+1}, .., w_{m+k}), y: w_{m+k+1} \\ x: (w_n, w_{n+1}, .., w_{n+k}), y: w_{n+k+1} \\ \end{bmatrix}$$By define the loss, such as cross entropy, we can get some model to generate tokens.
however, by the analysis of e.g-1, the goal we need is not just for we get the best assumption in each step, but we need a accumulated better result
which means, maybe in each step, we do not choose the best next action, but as a long term vision, we get some better result.
If we define our question, it like that: $$ \mathbf{Tokens_{gen}} = \begin{pmatrix} step_i, f(x_{i, .. i + n}; \theta) \rightarrow token_{i+n+1} \mid \theta \\ step_{i+1}, f(x_{i+1, .. i + n+1}; \theta) \rightarrow token_{i+n+2} \mid \theta \\ step_{i+x}, f(x_{i+x, .. i + n+x}; \theta) \rightarrow token_{i+n+x+1} \mid \theta \\ step_{i+n}, f(x_{i+k, .. i + n+k}; \theta) \rightarrow token_{i+n+k+1} \mid \theta \\ \end{pmatrix}$$
we need the accumulated result of $$token_{i+n+1}, token_{i+n+2}, token_{i+n+x}, token_{i+n+k}$$ to be the best.
therefore, in order to evaluated the each step's performance and the final result, we add a mark reward on each step:
$$ \mathbf{Tokens_{gen}} = \begin{pmatrix} step_i, f(x_{i, .. i + n}; \theta) \rightarrow token_{i+n+1} \mid \theta, Reward_i \\ step_{i+1}, f(x_{i+1, .. i + n+1}; \theta) \rightarrow token_{i+n+2} \mid \theta, Reward_{i+1} \\ step_{i+x}, f(x_{i+x, .. i + n+x}; \theta) \rightarrow token_{i+n+x+1} \mid \theta, Reward_{i+x} \\ step_{i+n}, f(x_{i+k, .. i + n+k}; \theta) \rightarrow token_{i+n+k+1} \mid \theta, Reward_{i+n} \\ \end{pmatrix}$$Mathematically, we are updating $\theta$ to get the biggest accumulated reward.
$$ argmax(\sum_{i} Reward_i \mid \theta) $$long-term object, sequential series learning
decision making¶
self-playing learning
We call this as our object of learning.
some more situations
agent: the entity to make some decisions, in AI algorithm, it usually means our training model;
(multi-agents: some tasks that need more than one agent to achieve the task, MARL)
state: the representation of the current agent preceived.
each step, i, $S_i$, next state: $S_{i+1}, S'$, $\mathcal{S}$, which is the collectoin of all possible states. $S_i \in \mathcal{S}$
terminal state: the final state, $ S_T $
environment: something outside of this model or agent, environment will give the reward and next-state, $\mathbf{Model}$
action: based on the current "state", the decision that agent has made. $a \in \mathcal{A}$
reward: a numerical representation of by the current action and state, $Reward(s, a) $, $r, R, reward$
transtion-probability: $$ Pr(S_{t+1} | S_t, a_t : \theta) $$
value: in a state, The expectation of the future reward's sum, $$V(s_i)$$
In this simplest situtation, $$ V(s_i) = R({s_i}) + R(s_{i+1}) + R(s_{i+2}), ... , + R(s_{T}), \\f(s_i) = a_i, environment(s_i, a_i) \rightarrow s_{i+1} $$
Therefore, our learning object is to maximize the $ \mathbf{Value}(S_0)$.
In order to make our model learn mode efficiently, we let them focus on the current reward more than future's reward a little.
Value with discount, $$ V(s_i) = R(s_i) + \gamma R(s_{i+1}) + \gamma ^2 R(s_{i+2}), ... , + \gamma ^ {T - i} R(s_{T}), \gamma \in (0, 1)$$
Dynamic programming character in RL , Bellman Equation
But for the real situatoin, $$ Pr(s_{t+1} | s_t, a_t) \neq 1$$
$$ V(s_i) = \sum R(s_i, a_i) * Pr(a_i | s_i: \theta) + \sum_{s_t \in |S|} \sum_{a \in |A|}(Vs_t * Pr(S_{t+1}|S_t, a)) * Pr(a | s_t)$$assuming $\exists v(s: \theta) \rightarrow \mathcal{R}$, in order to get the function parameters' $\theta$
most straightforward: $v(s)$, we get all the future $reward_i$
a smart way: $s_t, a_t, s_{t+1}, r_t$
$$ v_{new}{s_t} = r_t + \gamma v(s_{t+1} : \theta) $$$$ \mathcal{L(\theta)} = (v_{new}s_t - v(s_t))^2$$Therefore, our target in refinforcement learning is to find some policies can maximize the $Value(S_0), S_0 \in \mathcal{S}$
Q-value: the specific value in some state ($S_i$)and on action($a_i$)
$$ Q(s_i, a_i) = R(s_i, a_i) + \sum_{s_t \in |S|} \sum_{a \in |A|}(Vs_t * Pr(S_{t+1}|S_t, a)) * \pi(a | s_t)$$$$ Value(s_i) = \sum_{a \in |A|} Q(s_i, a_i) * \pi(a | s_i)$$Therefore, we can by finding the action which can make the biggest Q-value to learn how to make decision in each step.
$Q(s, a; \theta)$, some Q values may be very high, agent/model believe this action is very good action, then, the paramters will update to this direction with a big step
$V(s)$, if V(s) is a very big number too.
advanrage value
$$ A(s, a) = Q(s, a) - V(s) $$Image the fellowing two situtaions:
For a dirving car problem, the state is the vision surrounding some one has seen and the car's monitor information, such as engine round, gas amount, etc.

$$ reward(state_t) = -1 $$
We are training a model, which are maximize the $\sum{R_i}$
the next state only depends on the current state and environment random/stochastic variables
But, if this car suddenly remembers that there is a place it should have vistited but did not visit this place.
Then, this car will change its route and visit that place firstly.
In this sitation, the next state does not only depend on the current state, but only depends on memory.
If some a sequence of events or states where the probability of transitioning to a future state depends only on the current state and Env, and is independent of the past history.
In a Markov chain, the system evolves through a series of discrete time steps, moving from one state to another. The key components of a Markov chain are:
States: A set of possible states that the system can occupy. Each state represents a distinct condition or configuration.
Transition probabilities: For each pair of states, there is a probability associated with transitioning from one state to another in the next time step. These transition probabilities are often represented using a transition matrix or transition probability function.
Initial state probability distribution: The probability distribution that determines the initial state of the system. It represents the likelihood of starting from each possible state.
Markov chains are often referred to as "memoryless" systems
States: A set of possible states in the system. Each state represents a specific condition or situation in which the agent can find itself.
Actions: A set of possible actions that the agent can take in each state. The agent selects an action based on its current state.
Transition Probabilities: For each state-action pair, there is a probability distribution that defines the likelihood of transitioning from the current state to a subsequent state based on the chosen action.
Rewards: A numerical reward associated with transitioning from one state to another due to the chosen action. The reward captures the desirability or value of reaching a particular state.
Discount Factor: A discount factor between 0 and 1, denoted as γ, which determines the importance of future rewards. It influences the agent's preference for immediate rewards versus long-term rewards.
Initial state probability distribution: The probability distribution that determines the initial state of the system. It represents the likelihood of starting from each possible state.
q_table = {
state: [a1_v, a2_v, a3_, .. .. ]
}
TRAIN, EVALUATE = 'train', 'evaluate'
mode = 'train'
def policy(s):
if mode == TRAIN:
epsilon = 0.05
if random.random() < epsilon:
return random.choice(range(len(q_table[s])))
else:
return np.argmax(q_table[s])
elif mode == EVALUATE:
return np.argmax(q_table[s])
state = env.reset() # initial state
done = False
value_table = dict(list)
trajectories = []
for _ in range(10000):
while not done:
action = policy(state)
next_state, reward, done = env.step(action)
state = next_state
trajectories.append((s, a, reward, s_next))
goal = 0
gamma = 0.99
q_table_s_a_count = defaultdict(int)
for _s, _a, _r, _s_n in trjectories[::-1]:
goal = _r + gamma * goal
q_table[_s][_a] += goal
q_table_s_a_count[(_s, _a)] += 1
for _s, actions in q_table.items():
for a in actions:
q_table[_s][a] = q_table[_s][a] / q_table_s_a_count[(_s, _a)]